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Abstract 

Metagenomics is a powerful approach to study genetic content of environmental 
samples, which has been strongly promoted by NGS technologies. To cope with 
massive data involved in modern metagenomic projects, recent tools [21 144) rely on 
the analysis of fc-mers shared between the read to be classified and sampled reference 
genomes. Within this general framework, we show in this work that spaced seeds 
provide a significant improvement of classification accuracy as opposed to traditional 
eontiguous fc-mers. We support this thesis through a series a different computational 
experiments, including simulations of large-scale metagenomic projects. 

Scripts and programs used in this study, as well as supplementary material, are 
available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics 


1 Introduction 

Metagenomics is a powerful approach to study genetic material contained in environmen¬ 
tal samples. The advent of high-throughput sequencing technologies {Next-Generation 
Sequencing, commonly abbreviated to NGS) revolutionized this approach, by avoiding 
the need of cloning the DNA and thus greatly facilitating the obtention of metagenomic 
samples, at the same time drastically decreasing its price. Early examples of metage¬ 
nomic projects include the analysis of samples of seawater gntts], human gut [36] , or soil 
|43j . Present-day metagenomic studies focus on various bacterial, fungal or viral popu¬ 
lations, exemplified by the Human Microbiome project |35| that investigates microbial 
communities at different sites of human body. 

Modern metagenomics deals with vast sequence datasets. On the one hand, metage¬ 
nomic samples {metagenomes) obtained through NGS are commonly measured by tens 
or even hundreds of billions of bp m- These sequences generally come from a number 
of different species, some of which either have a previously sequenced reference genome, 
or have a related sequenced species sufficiently close phylogenetically to determine this 
relationship by sequence comparison. Other sequences, however, may come from or¬ 
ganisms that have no sufficiently close relatives with sequenced genomes, or from DNA 
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fragments that show no significant similarity with any available genomic sequence. The 
metagenomic classification problem is to assign each sequence of the metagenome to a 
corresponding taxonomic unit, or to classify it as ’novel’. 

A way to improve the accuracy of metagenomic classification is to match the metage¬ 
nome against as large set of known genomic sequences as possible. With many thousands 
of completed microbial genomes available today, modern metagenomic projects match 
their samples against genomic databases of tens of billions of bp [l^ . 

Alignment-based classifiers [26] proceed by aligning metagenome sequences to each of 
the known genomes from the reference database, in order to use the best alignment score 
as an estimator of the phylogenetic “closeness” between the sequence and the genome. 
This could be done with generic alignment program, such as Blast [2| , Blastz |37| , or 
Blat m - While this approach can be envisaged for small datasets (both metagenome 
and database) and is actually used in such software tools as Megan [T1| or PhymmBL 
|8| (see |26| for more), it is unfeasible on the scale of modern metagenomic projects. 
On the other hand, there exists a multitude of specialized tools for aligning NGS reads 
- BWA [22], Novoalign (http://www.novocraft.coni/), GEM |28j . Bowtie [20] . 
just to mention a few popular ones - which perform alignment at a higher speed and 
are adjusted to specificities of NGS-produced sequences. Still, aligning multimillion read 
sets against thousands of genomes remains computationally difficult even with optimized 
tools. Furthermore, read alignment algorithms are usually designed to compute high- 
scoring alignments only, and are often unable to report low-quality alignments. As a 
result, a large fraction of reads may remain unmapped |24| . 

Several techniques exist to reduce the computational complexity of this approach. 
One direction is to pre-process the metagenomic sample in order to assemble reads 
into longer contigs, potentially improving the accuracy of assignment. Assembly of 
metagenomic reads has been a subject of many works (see [SU]) and remains a fragile 
approach, due to its error-proneness and high computational complexity. Overall, it 
appears feasible mostly for small-size projects with relatively high read coverage. 

To cope with increasingly large metagenomic projects, alignment-free methods have 
recently come into use. Those methods do not compute read alignments, thus do not 
come with benefits of these, such as gene identification. Alignment-free sequence compar¬ 
ison is in itself an established research area, reviewed in a recent dedicated special issue 
|42| . Most of alignment-free comparison methods are based on the analysis of words, 
usually of fixed size (fe-mers), occurring in input sequences. A popular approach is to 
compute the distance between frequency vectors of all /c-mers in each of the sequences. In 
the context of metagenomics, however, when one of the sequences is short (NGS read), 
the analysis is based on the shared /c-mers, without taking into account their multiplici¬ 
ties in reference genomes. This is also dictated by the prohibitive computational cost of 
computing and storing /c-mer multiplicities for metagenomics-size data. 

Two recently released tools - LMAT [3] and Kraken [H] - perform metagenomic 
classification of NGS reads based on the analysis of shared /c-mers between an input 
read and each genome from a pre-compiled database. Given a taxonomic tree involving 
the species of the database, those tools “map” each read to a node of the tree, thus 
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reporting the most specific taxon or clade the read is associated with. Mapping is done 
by sliding through all /c-mers occurring in the read and determining, for each of them, 
the genomes of the database containing the A:-mer. Based on obtained counts and tree 
topology, algorithms [3111 assign the read to the tree node “best explaining” the counts. 
Further similar tools have been published during last months ISllIS]. 

The goal of this work is to show that the metagenomics classification based on the 
analysis of shared k-mers can be improved by using spaced k-mers rather than contiguous 
k-meis. 

The idea of using spaced /c-mers goes back to the concept of spaced seeds for seed-and- 
extend sequence comparison jssiini. There, the idea is to use as a seed (i.e. local match 
triggering an alignment) a sequence of matches interleaved with “joker positions” holding 
either matches or mismatches. The pattern specifying the sequence of matches and 
jokers is called the spaced seed. Remarkably, using spaced seeds instead of contiguous 
seeds significantly improves the sensibility-selectivity trade-off with almost no incurred 
computational overhead. This has been first observed in [21 and then extended and 
formally analysed in a series of further works, see [niiH] and references therein. 

Recently, it has been reported in several works that spaced seeds bring an improve¬ 
ment in alignment-free comparison as well. In |21) . it is shown that comparing frequency 
vectors of spaced k-mers (fc-mers obtained by sampling must-match positions of one or 
several spaced seeds), as opposed to contiguous fc-mers, leads to a more accurate estima¬ 
tion of phylogenetic distances and, as a consequence, to a more accurate reconstruction 
of phylogenetic trees. In [29], the authors studied another measure - the number of 
pairs of matching (not necessarily aligned) spaced fe-mers between the input sequences 
- and showed that it provides an even better estimator of the phylogenetic distance. 
In [32], it is shown that the number of hits of appropriately chosen spaced seeds in 
aligned sequences and their coverage (i.e. the total number of matched positions cov¬ 
ered by all hits) provides a much better estimator for the alignment distance than the 
same measures made with contiguous seeds. From a machine learning perspective, works 
[Ml [12] show that spaced seeds provide better string kernels for SVM-based sequence 
classification, confirmed by experiments with protein classification (see also |32ji. 

In this work, we show that using spaced fe-mers significantly improves the accuracy 
of metagenomic classification of NGS reads as well. To support this thesis, we consider 
different scenarios. As a test case, we first study the problem of discriminating a read 
between two genomes, i.e. determining which of the two genomes is “phylogenetically 
closer” to the read. We then analyse the correlation between the quality of an alignment 
of a read to a genome and the seed count for this read, defined either as the number of 
hits or as the coverage. This analysis provides an insight into how well one can estimate 
the similarity between a read and a genome out of /c-mer occurrences. Finally, we make a 
series of large-scale metagenomic classification experiments with Kraken software [H] 
extended by the possibility of dealing with spaced seeds. These experiments demonstrate 
an improved classification accuracy at the genus and family levels caused by the use of 
spaced seeds instead of contiguous ones. 
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2 Preliminaries 


A spaced seed is a binary pattern over symbols # and - denoting match and joker 
respectively. For example, seed #-## specifies a match followed by either match or 
mismatch followed by two consecutive matches. A seed acts as a mask for comparing 
short oligonucleotides, for example sequences gaat and gcat differ at the 2nd position, 
but they match via seed #-## as the 2nd position is masked out. The number of #’s in a 
seed, called weight, defines the number k of matching nucleotides. In the above example, 
k = 3 and the matching (spaced) k-mei is gat. In a slightly different terms, a seed is a 
pattern that specifies a small part of a gapless alignment seen as a binary sequence of 
matches and mismatches. For example, seed #-## occurs in (or hits, as usually said) any 
alignment containing a match followed by another two matches shifted by one position. 

When spaced seeds are used for sequence alignment within the seed-and-extend 
paradigm [9], a pair of matching /c-mers (or, sometimes, a matching sequence of sev¬ 
eral closely located /c-mers) indicates a potential alignment of interest in which the two 
/c-mers are aligned together. When spaced seeds are used for alignment-free sequence 
comparison [ 32 ], the goal is to estimate the similarity of two sequences based on the 
number of matching /c-mers, with no or limited information about their positions in one 
or both sequences. This measure can be formalized in several different ways. 

In the context of metagenomic classification of NGS reads, the goal is to estimate 
the evolutionary distance between a short read and a long sequence (genome), which can 
be modeled by the best alignment score of the read against the sequence. Due to large 
genome size and large number of reference genomes involved in computations, one of 
the constraints is to avoid keeping track of positions of fc-mers in genomes. We can only 
afford constructing an index to quickly answer queries whether or not a /c-mer occurs 
in a given genome, without information on its positions. On the other hand, /c-mer 
positions in the query read can be included to the analysis. Therefore, we have to derive 
our estimation from the number of /c-mers shared between the read and the genome, 
together with their positions in the read alone. 

One simple estimator is the number of /c-mers in the read that occur in the target 
genom^ we call this measure the hit number. However, one may want to favor cases 
when matching fc-mers cover a larger part of the read, vs. those with matching /c-mers 
clumped together due to overlaps. This leads to the concept of coverage |32|, earlier used 
in seed-based alignment algorithms as well miEi. The coverage is the total number of 
positions covered by all matching /c-mers. For example, consider seed #-## and read 
gaatcagat. Assume the seed hits at positions 1,4,6, i.e. /c-mers g-at, t-ag and a-at 
occur in the target genome (joker symbol is shown for the sake of clarity). Here, the 
hit number is 3, and the coverage is 7 as seven positions are covered by hits, namely 
positions 1,3,4,6,7,8,9. Hit number and coverage are two estimators studied in this work. 

^Here identical fc-mers occurring at distinct positions in the read are considered distinct. 
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3 Results 

3.1 Binary classification 

From the machine learning viewpoint, hit number and coverage can be viewed as in¬ 
stances of kernel functions for sequence data (see e.g. IS]). Our first step was to compare 
their capacity w.r.t. a simple binary classification task. Assume the (impractical) case 
when our database contains only two genomes. Given a read, we have to decide which 
of the two genomes the read is closer to. How good can we be at that? Which kernel 
works better for this task? And are spaced seeds better than contiguous seeds here? 


3.1.1 Classifying aligned reads 


As mentioned in Introduction, the “distance” of a read to a genome translates naturally 
to the score of the best alignment of the read to the genome. Given two genomes, we 
want to tell which of them is closer to a given read. 

Gonsider alignments Ag, Ai of a random read to the two genomes, and assume they 
are gapless and have mismatch probabilities Ps,Pi respectively, with pg < pi. Throughout 
this paper the read length is set to 100, a typical length of Illumina read. Therefore, 
the alignments can be thought of as random binary strings of length 100 of matches 
and mismatches, with mismatch probabilities pg and pi respectively. Given a seed, we 
compute two counts Cg and Ci on alignments Ag and Ai respectively, where by ’count’ we 
mean, unless otherwise stated, either the hit number or the coverage. For example, if the 
seed is #-## and the alignment 111101111 (1 stands for match and 0 for mismatch), then 
the hit number is 3 and the coverage is 7. Note that in this model, common (spaced) 
/c-mers are assumed to occur necessarily at the same position in the read alignment, 
although in reality, a /c-mer of the read may not be aligned with the same fc-mer in the 
genome. However, in this first experiment, we abstract from this fact. 

If Cg > Cl (resp. Cg < Ci), then we report a correct (resp. incorrect) classification, 
otherwise a tie is reported. By iterating this computation, we estimate the probability 
of correct/incorrect classification for each parameter set. 

The results are presented in Tables |la|lb|lc] In all cases, spaced seeds show a better 
classification power. In some cases, the difference is striking: for example, if we want 
to discriminate between alignments with mismatch probabilities 0.1 and 0.2 (Table [Tb| ) 
using seeds of weight 22, then a spaced seed yields 86% of correct classifications (coverage 
count), whereas the contiguous seed correctly classifies only 65% of cases, whereas the 
fraction of incorrect classifications is essentially the same. The results also show a slight 
edge of the coverage count over the hit number, which suggests the superiority of the 
latter that will be confirmed later on in other experiments. 


3.1.2 Classifying unaligned reads 

Let us now turn to a more practical setting, where we want to classify reads coming from 
a genome G between two other genomes Gi and G 2 based on the phylogenetic closeness. 
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(a) ps = 0.05, Pi = 0.1 



contiguous 

spaced 

weight 

Hits 

Cover 

Hits 

Cover 

16 

.85/.14 

.85/.14 

.87/.11 

.88/.11 

18 

.84/.14 

.85/.14 

.87/.12 

.88/.11 

20 

.83/.15 

.84/.15 

.87/.12 

.87/.12 

22 

.82/.16 

.83/.15 

.86/.12 

.87/.12 

24 

.80/.16 

.81/.15 

.85/.12 

.87/.12 


(b) Ps = 0.1, Pi = 0.2 



contiguous 

spaced 

weight 

Hits 

Cover 

Hits 

Cover 

16 

.86/.09 

.87/.09 

.93/.05 

.94/.05 

18 

.81/.08 

.81/.08 

.91/.05 

.92/.05 

20 

.74/.07 

.74/.07 

.89/.05 

.90/.05 

22 

.65/.06 

.65/.06 

.85/.05 

.86/.05 

24 

.55/.04 

.56/.04 

.79/.04 

.81/.05 


(c) Ps = 0.2, Pi = 0.3 



contiguous 

spaced 

weight 

Hits 

Cover 

Hits 

Cover 

16 

.40/.06 

.40/.06 

.63/.07 

.64/.07 

18 

.28/.04 

.28/.03 

.50/.05 

.50/.05 

20 

.18/.02 

.18/.02 

.37/.03 

.37/.03 

22 

.12/.01 

.12/.01 

.25/.02 

.26/.02 

24 

.08/.00 

.08/.00 

.17/.01 

.18/.01 


Table 1: Each entry contains a pair “Probability of correct Classifica¬ 
tion/Probability of incorrect classification”. The remaining fraction es¬ 
timates the probability of a tie. Spaced seeds used for hit number: 

###-##-#-#—#-#-#—#—##### (weight 16), ####-##—##—#—#-#-#-##-#### (weight 18), 

#####-###—##—#—#-#-#-##-#### (weight 20), ######—####-#-##-#-#—####### 

(weight 22), #######—####-#—##-#-#—######## (weight 24). Spaced seeds used for 

coverage: ###-###—#-#—#-##—##### (weight 16), ###-#—###-##—#-#—###-#### (weight 
18), ###-#-##-#-##—##—###-#-##### (weight 20), 

(weight 22), (weight 24). 
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To study this, we implemented the following experimental setup. Using DWGSIM read 
simulator (https://github.com/nhl3/DWGSIM), we generate single-end iLLUMlNA-like 
reads from genome G. In all experiments, we assumed 1% of base mutations (sub¬ 
stitutions only), and 2% of sequencing errors (dwgsim options -e 0.02 -r 0.01 -R 
0). Given a seed, (contiguous or spaced) /c-mers of Gi and G 2 are indexed to support 
existence queries only. For each read, all fe-mers are queried against Gi and G 2 and 
corresponding counts Gi and G 2 are computed. If Gi > G 2 (resp. Gi < C 2 ), the read is 
classified to be closer to Gi (resp. G 2 ), otherwise a tie is reported. Besides considering 
absolute counts (hit number and coverage), we also considered hit number normalized 
by the number of distinct A:-mers in the corresponding genome (computed at the index¬ 
ing stage). This measure approximates the Jaccard index [5] and reflects the Bayesian 
probability of seeing a /c-mer relative to the “/s-mer-richness” of genomes. 

Note that the counts are here computed relative to the whole genome, as it is done 
in the approach of (see Introduction). This means that the /c-mers occurring in 

the read are looked up in the whole genome, without guarantee, however, that these 
/c-mers are closely located in the genome and contribute to the same read alignment. 
This makes the seed weight an important parameter, as seeds of low weight may result 
in a high read count which does not evidence any alignment of the read, due to random 
character of the /c-mer hits. 

We experimented with bacterial genomes belonging to Mycobacterium genus. Mem¬ 
bers of this genus present low interspecies genetic variability and their phylogeny remains 
uncertain m- 

If G coincides with one of Gi, G 2 , i.e. reads have to be classified between its source 
genome and another genome, then all estimators correctly classify nearly all reads as soon 
as Gi and G 2 are genomes of distinct species. For example, classifying reads obtained 
from Mycobacterium tuberculosis (H37Rv, acc. NC_018143) against M. tuberculosis itseli 
and M.avium (104, NC_008595) led to more than 99% of correct classifications for all 
estimators. 

The case when G is distinct from Gi,G 2 appears more interesting. It corresponds 
to the real-life situation when reads to be classified can come from a genome that is not 
represented in the database. Here we expect our procedure to determine whether G is 
phylogenetically closer to Gi or to G 2 . 

For example, we classified M.vanbaalenii (PYR-1, NC_008726) reads against M.smeg- 
matis (MC2 155, NC_018289) and M.gilvum (PYR-GCK, NG_009338) genomes. Alter¬ 
native phytogenies given in [401 Fig.1-4] imply different evolutionary relationships among 
these three species. Our results, shown in Table suggest that M.vanbaalenii is closer 
to M.gilvum than to M.smegmatis. For non-normalized hit number and coverage estima¬ 
tors, this conclusion is supported by seeds of weight 16 or more, while weight 14 supports 
the opposite conclusion. This is due to spurious hits that become dominating when the 
weight drops to 14, and to the larger size of M.smegmatis genome (6.99Mbp) compared 
to M.gilvum (5.62Mbp). This effect is corrected by Jaccard index due to normalization 
by the number of distinct A:-mers (6.09M for M.smegmatis vs. 4.96M for M.gilvum for 
the spaced seed of weight 14). Overall, we observe a significantly sharper discrimination 
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Table 2: Classification of Mycobacterium vanbaalenii reads against Mycobacterium smeg- 
matis and Mycobacterium gilvum genomes. Each entry contains a pair “Fraction (in %) of 
reads classified closer to M.smegmatis / Fraction of reads classified closer to M.gilvurri\ 



weight 


14 

16 

18 

20 

22 

contig hit nb 

52/41 

39/48 

24/37 

11/24 

07/17 

contig cover 

54/42 

44/47 

25/37 

11/24 

06/17 

contig Jaccard 

30/70 

35/61 

23/42 

11/26 

06/18 

spaced hit nb 

51/40 

34/47 

20/40 

12/32 

08/23 

spaced cover 

53/42 

39/51 

21/42 

12/32 

08/25 

spaced Jaccard 

28/72 

32/66 

20/50 

11/33 

08/27 


Table 3; Classification of Bacillus thuringiensis reads against Bacillus anthracis and 
Bacillus cereus genomes. Each entry contains a pair “Fraction (in %) of reads classified 
closer to B. anthracis / Fraction of reads classified closer to B. cereus ”. 



weight 


14 

16 

18 

20 

22 

contig hit nb 

83/14 

81/11 

79/09 

77/08 

76/08 

contig cover 

78/17 

80/12 

79/09 

77/08 

76/08 

contig Jaccard 

87/13 

87/11 

85/09 

83/08 

82/08 

spaced hit nb 

83/13 

82/11 

80/09 

79/09 

79/08 

spaced cover 

80/15 

81/11 

80/09 

79/09 

79/08 

spaced Jaccard 

88/12 

88/11 

85/09 

84/08 

84/08 


produced by spaced seeds compared to contiguous seeds. 

We also performed a series of experiments with the large and genetically variable 
Bacillus genus. Table shows a demonstrative experiment with members of Bacil¬ 
lus cereus group: B. thuringiensis (serovar konkukian 97-27, NC_005957), B. anthracis 
(Ames, NC_003997), and B.cereus (ATCC 14579, NC_004722). The three bacteria are 
close to the point of being considered to be different lineages of a single B. cereus species 
m- The results provide a strong support that the B. thuringiensis strain is closer to 
the B.anthraeis strain than to the B. cereus strain, which agrees with phylogenies re¬ 
ported in the literature [T]. Indeed, the B.thuringiensis strain and the B.anthraeis strain 
have a much higher pairwise identity rate than the former has with the B. cereus strain 
(estimated DNA-DNA hybridization distance 81% vs. 45%, as computed by GGDC 
0 )- 

However, for species with low sequence similarity, a large majority of reads may 
have no hits to either genome, and only a small fraction of reads may reveal a significant 
difference in distance. This situation is illustrated in Table [^showing the results of classi¬ 
fication of B.lieheniformis (ATCC 14580, NC_006270) reads against B.anthraeis (Ames, 
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Table 4: Classification of Bacillus licheniformis reads against Bacillus anthracis and 
Bacillus pumilus genomes. Each entry contains a pair “Fraction (in %) of reads classified 
closer to B.anthracis /Fraction of reads classified closer to B.pumilus 



weight 


14 

16 

18 

20 

22 

contig hit nb 

47/40 

24/25 

04/07 

0.9/3.8 

0.3/2.4 

contig cover 

49/46 

24/25 

04/07 

0.9/3.8 

0.3/2.4 

contig Jaccard 

37/62 

24/30 

04/08 

0.8/4.0 

0.3/2.6 

spaced hit nb 

49/33 

22/27 

04/10 

1.0/5.4 

0.4/3.3 

spaced cover 

53/39 

23/27 

04/10 

1.0/5.4 

0.5/3.3 

spaced Jaccard 

41/59 

22/35 

04/10 

0.9/4.6 

0.4/3.5 


NC_003997) and B.pumilus genomes (SAFR-032, NC_009848). The results support a 
higher similarity of B. licheniformis to B.pumilus than to B. anthracis, but the difference 
is revealed on a very small fraction of reads. The conclusion, however, is signihcant as 
those reads represent the majority of reads having any hits to one of the genomes. As 
with the previous example, this result is conhrmed by previously reported phytogenies 


ra¬ 
in all our experiments reported in this section, spaced seeds showed a better classi¬ 
fication capacity. The difference is especially signihcant in “nontrivial” cases involving 
relatively dissimilar genomes, such as those illustrated by Tables and While the 
difference between hit number and coverage estimators appeared insignihcant (in agree¬ 
ment with results of Section 3.1.1), the Jaccard index generally provides a more distinct 
discrimination and, combined with a spaced seed, appears to be the best estimator. 


3.2 Correlation of counts with alignment quality 

In metagenomic projects, reads to be classihed do not necessarily come from genomes 
stored in the database, but can come from genomes of other species. These species can be 
genetic variants of species of the database, such as different strains of the same bacteria, 
but can also come from organisms represented in the database only at the rank of genus 
or family, or may even have no representatives at all at low taxonomic ranks. Therefore, 
an accurate mapping of a read to a corresponding clade requires not just assigning it to 
the appropriate sampled genome, but estimating its distances to each of the genomes in 
order to locate its position within the whole taxonomic tree. 

With this motivation, we turned to the question how well the measured counts cor¬ 
relate with the alignment score. For a hxed minimal identity rate pid, we randomly 
sampled gapless alignments of length 100 with identity rate from interval [pid--l], and 
collected pairs (number of mismatches, count), where, as before, ’count’ stands for ei¬ 
ther the number of hits or the coverage of a given seed. For these data, we computed 
Spearman’s rank correlation. 

Results are presented on plots in Figure They show that when the identity rate 
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Figure 1: Spearman’s rank correlation between score and counts, depending on the 
minimal identity rate. 
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of alignments takes a large range of values (minimal id rate smaller than ~ 0.9), spaced 
seeds yield a significantly higher correlation than contiguous seeds, for both hit-number 
and coverage counts. Furthermore, the coverage count slightly outperforms the hit- 
number count, especially for spaced seeds and larger weights. 

For high-similarity alignments, however, the picture changes: the coverage count 
loses its performance, with its correlation value sharply decreasing. Furthermore, the 
correlation of hit-number goes down as well for spaced seeds, while it continues to grow 
for contiguous seeds ending up by reaching and even slightly outperforming the one for 
spaced seed. This is due to a larger span of spaced seeds and to their combinatorial 
properties that cause the hit number values to be less sharply concentrated at certain 
values, and therefore to be less well correlated with the number of mismatches. 

In conclusion, while spaced seeds provide a much better estimator for alignments 
whose quality ranges over a large interval, for high-quality alignments (> 90% of iden¬ 
tity), the hit number of contiguous seed becomes a better estimator. The superiority 
of hit-number over coverage for high-quality alignments has also been reported in [32] . 
Along with Spearman’s correlation, we also made an analysis of mutual information 
computed on the same data (data not shown) that confirmed the above conclusions. 


3.3 Correlation on real genomes 


To validate the conclusions of the previous section in a real-life metagenomics framework 
and to analyse more closely how well different counts for a read correlate with the best 
alignment of the read to a real genomic sequence, we implemented another series of 
experiments. 

Given a genome G, we generated a set of Illumina-like single-end reads by selecting 
random substrings of G of length L {L = 100) and introducing k mismatch errors, with 
k drawn randomly between 1 and 20 for every read. For each read, we computed the 
counts - hit number and coverage - with respect to genome G under a given seed, similar 
to Section [3. 1.2 Collected data have then been analysed. 

This experimental setup has been applied to Mycobacterium tuberculosis genome, a 
typical result (seed weight 20) is shown in Figure Each plot shows the density of 
reads for each pair (number of mismatches, count), depending on the seed (contiguous 
or spaced) and count (hit-number or coverage). Spearman’s and Pearson’s correlation 
coefficients are shown for each plotted dataset. 

The plots clearly illustrate the advantage of spaced seeds over contiguous seeds for 
estimating the alignment quality. Plots for contiguous seeds are more blurred whereas 
plots for spaced seeds demonstrate a better correlation between the two values. This 
is confirmed by the absolute values of Spearman’s rank correlation coefficient that are 
significantly higher for spaced seeds, indicating a better statistical dependence. This is 
further illustrated in Figure]^ which shows the same data through the average curve and 
95% confidence band. It confirms that spaced seeds produce a dependence with lower 
deviation from the mean, compared to contiguous seeds. 

Comparing hit-number and coverage estimators, we observe that coverage yields a 
slightly better Spearman correlation and a significantly better Pearson correlation, due 
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mismatches in randomly generated reads. Seed is shown above the plot, and Spearman’s 
and Pearson’s correlations are shown below. Grayscale shows the density of reads. 
Experiments made on M. tuberculosis genome. 
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to a convex shape of the dependence, compared to the more straight dependence for the 
coverage. 

This analysis has been done for several other bacterial genomes, producing similar 
results. Plots for other genomes and other seed weights can be found at https: // 
github.com/gregorykucherov/spaced-seeds-for-metagenomics, 


3.4 Large-scale experiments 

In order to validate the advantage of spaced seeds in large-scale metagenomic projects, 
we modified Kraken software [H] to make it work with spaced seeds rather than with 
contiguous seeds only. The limitation of this comparison is that it only allows estimating 
the effect of using spaced seeds combined with the Kraken algorithm, and within its 
implementation. On the other hand, this procedure allows us to estimate the effect of 
spaced seeds in an unbiased manner, by keeping unchanged all other factors that might 
influence the results. 

Our extended implementation, that we call seed-Kraken, allows the user to specify 
a spaced seed as a parameter. For a set of genomes, a database of spaced fc-mers matching 
the seed is constructed, which is later used to classify reads through the original Kraken 
algorithm. Since Kraken uses fe-mer counting Jellyfish program m as the fe-mer 
indexing engine, we had also to modify Jellyfish to allow it to deal with spaced fc-mers. 

Integration of spaced seed into Kraken required a minor modification of the way 
Kraken deals with complementary sequences. In Kraken, complementary A:-mers have 
a single representative in the index, the lexicographical smallest of the two. With spaced 
seeds, dealing with complementary sequencing is more delicate, as the complement of 
a spaced A:-mer does not match the same seed but its inverse. To cope with this, we 
modified Kraken to index each distinct A:-mer. We then processed each read in direct 
and complementary directions separately and select the one which produced more hits. 
Compared to original Kraken, this procedure takes more index space (additional ~ 5% 
in practice) and doubles A:-mer query time. 

We compared the performance of Kraken and seed-Kraken on several datasets. 
First, we performed experiments with three simulated metagenomes HiSeq, MiSeq, and 
simBA-5 introduced in the original work [U], each containing 10,000 sequences. Further¬ 
more, we created a dataset from Human Microbiome Project data by randomly selecting 
50,000 sequences from SRS011086 Tongue dorsum metagenomic sample 
(http://hmpdacc.org/HMSCP/). Here we only report on results for MiSeq and HMP- 
tongue datasets and refer to the supplementary material for a complete account includ¬ 
ing results for HiSeq and simBA-5. MiSeq is a merge of Illumina reads of 10 bacterial 
genomes, and HMPtongue is a random sample of real Illumina whole-metagenome 
sequences. 

Due to resource limitations, the database we used in MiSeq experiments was half of 
the size of the Kraken’s default database (which requires 70GB of RAM). Our database 
was obtained by choosing a single representative strain of each bacteria species, except 
for the species from HiSeq and MiSeq metagenomes for which all strains were included. 
Overall, this represented 915 genomes of total size 3.3GB. For HMPtongue dataset, this 
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Figure 4: Sensitivity/precision of seed-Kraken (circle points) and original Kraken 
(cross points) for HMPtongue and MiSeq datasets and three taxonomic levels: species, 
genus and family. Triangle points correspond to SEED-Kraken run on contiguous seed 
of weight 24 and 31, plotted to highlight the effect of the change in the assignment 
algorithm. 
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database was extended with a subset of HMP reference library, 0.8GB in total, including 
references for the selected 50,000 sequences. 

For each metagenomic dataset, we measured the sensitivity (percentage of correctly 
classified reads out of all reads) and precision (percentage of correct classifications out 
of all classifications) of Kraken and seed-Kraken at three taxonomic levels: species, 
genus and family. In each case, this has been done with seeds of different weights between 
20 and 31, and for each weight, seed-Kraken has been run on a few different spaced 
seeds (see Conclusions). 

Figure shows sensitivity-precision ROC-curves (Receiver Operating Characteristic) 
for seed-Kraken, and for the unmodified Kraken. In the case of seed-Kraken, 
the “best performing” seed is charted for each weight. Furthermore, triangle points 
correspond to seed-Kraken run on contiguous seeds, plotted in order to measure the 
effect of our modification in dealing with complementary sequences. 

At the levels of genus and family, spaced seeds consistently show a better sensitivity- 
precision trade-off, with the sensitivity increase of about 2 percentage points for MiSeq 
and 3-5 points for HMPtongue, for a given precision rate. The results of seed-Kraken 
with contiguous seeds (triangle points) confirm that this improvement is due to the 
use of spaced seeds and not to our slight modification of the assignment algorithm 
due to complementary sequences. For small weights (20-22), a spaced seed achieves 
simultaneously a better sensitivity and a better precision than the contiguous seed of 
the same weight. When the weight grows, the increment in precision disappears reaching 
the level of the contiguous counterpart, or sometimes coming down below it. However, 
this is largely compensated by the increase in sensitivity. 

For the species level, the picture turns out to be more involved. Here we observe 
that due to the small modification of the assignment algorithm, seed-Kraken run with 
contiguous seeds (triangle points) shows a modified behavior compared to Kraken. 
Specifically, we observe a drop in precision and a gain in sensitivity, and those are dif¬ 
ferent for MiSeq and HMPtongue datasets. The reason for this is that seed-Kraken 
makes more species-level classifications than Kraken but at the same time, makes 
more inaccurate assignments to a closely related organism (typically, different strain 
of the same bacteria), which eventually leads to a lower precision. This phenomenon 
has a bigger impact for rich databases (HMPtongue experiment) compared to “sparse” 
databases where each species is represented by few organisms (MiSeq). As for the con¬ 
tribution of spaced seeds, we observe an improved sensitivity-precision trade-off here 
as well. Compared to seed-Kraken applied to contiguous seeds, this improvement is 
small for MiSeq but significant for HMPtongue, which shows a correction capacity of 
spaced seeds w.r.t. erroneous assignments to close strains. Compared to the original 
Kraken, we obtain a sensitivity increment of about 1% which becomes smaller (MiSeq) 
or completely disappears (HMPtongue) when the seed weight grows. 

As mentioned earlier, the spaced seed corresponding to each plotted seed-Kraken 
point has been selected out of a few (usually two to four) seeds tried. The full list of seeds 
applied in experiments and corresponding results can be found in the supplementary 
material. Here we just mention that for large weights (24 and more) the span of the seed 
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becomes an important factor, with seeds of large span showing a drop in sensitivity and 
best seeds being those with relatively few jokers. 

Building a seed-Kraken limited database takes approximately 1 hour on a server 
with 20 CPU cores, and the resulting size is 26 GB for seed of weight 24, which compares 
to the 25GB for original Kraken. The classification running times are longer than for 
original Kraken by a factor of 3 to 5. 


4 Conclusions 


Through a series of computational experiments, we showed that spaced seeds signifi¬ 
cantly improve the accuracy of metagenomic classihcation of short NGS reads. The 
superiority of spaced seeds for different variants of alignment-free sequence comparison 
has been recently demonstrated by other authors as well I331II21EI1132]. In this work, 
we specifically focused on the metagenomics setting characterized by very large volumes 
of data, both in terms of the number of reads and the size of genomic database. This 
quantity of data precludes using some alignment-free comparison techniques, and leaves 
room only for highly time- and space-efficient approaches. Note also that in our setting, 
we have to compare short sequences (reads) with long ones (whole genomes), which 
makes an important difference with problems considered in 12IIE21EH]. For example, in 
the framework of metagenomic classification, it is hardly conceivable taking into account 
/c-mer frequencies |21j . as this information would be computationally difficult to utilize. 

Another improvement considered in mm [29] is to use multiple seeds, i.e. sev¬ 
eral seeds simultaneously instead of a single one. This extension is known to bring an 
advantage in seed-and-extend sequence alignment [38l|23], and mm show that this 
improvement applies to alignment-free comparison as well. However, each seed requires 
to build a separate index for database genomic sequences, and therefore it appears com¬ 
putationally difficult to use multiple seeds in metagenomics, unless some new indexing 
techniques are designed for this purpose. 

In our work, we studied three estimators: hit number, coverage and Jaccard index. 
Hit number and coverage behave similarly in classihcation (Section |3.1[), but Jaccard 


index generally improves on them in the case of mapping to real genomes (Section 3.1.2), 
due to the correction w.r.t. the A:-mer-richness. Gonsidered as an estimator of alignment 
quality (Section |3.2[ |3.3[ ), coverage provides a certain advantage over hit number. More 
subtle estimators can be considered as well, e.g. by taking into account the position of 
fe-mer in the read (rehecting the sequencing error rate), and this provides an issue for 
further study. 

Designing efficient seeds for metagenomic classihcation is another important issue 
that goes beyond the present study. Note that optimal spaced seeds for seed-and-extend 
alignment are generally not optimal for alignment-free fc-mer-based comparison [32] . 
In |32j . the authors designed (sub-)optimal seeds maximizing the Pearson correlation 
between hit-number/coverage count and the alignment quality. Their solution is imple¬ 
mented in Iedera softwar^ (http://bioinfo.lih.fr/yass/iedera) [19]. On the other hand, 

^the latest version of Iedera performs design for Spearman correlation as well 
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recent work m introduces quadratic residue seeds (QR-seeds) for seed-and-extend align¬ 
ment, which present a good performance and have the advantage of easy design, avoid¬ 
ing the computationally expensive enumeration of Iedera. In our work, we used both 
Iedera and QR-seeds adapted to our setting. We observed that in most cases, Iedera 
seeds are superior (being designed specifically for our task) but in a few cases, QR-seeds 
demonstrated equal or even better performanc^ This may be due to their large span 
(cf. supplement material) for which applying Iedera is computationally costly. 

We now summarize the main contributions of our work. 

• We showed that spaced seeds can drastically improve the success rate of binary 
classification of alignments into two categories, each defined by a specific mismatch 
rate. Here the classification is done through “querying” an alignment using a seed 
as a mask and reporting whether the seed applies at a given position. For example, 
in discriminating between alignments of length 100 with mismatch rate 0.2 and 0.3, 
a spaced seed of weight 16 achieves 63% of success while a contiguous seed of the 
same weight achieves only 40% (Table [I^. 

Recently, spaced seeds have been shown to define more efficient kernels for SVM 
classification of both protein [33] and nucleotide (RNA) sequences [32]. Compared 
to these works, here we demonstrate the superiority of spaced seeds in a very simple 
classification setting, where sequences have to be classified according to identity 
rate, without a training stage and without resorting to SVM machinery. 

• We showed experimentally that spaced seeds allow for a better classification of NGS 
reads coming from a genome G between two other genomes Gi and G 2 of the same 
genus. Here reads are classified according to the phylogenetic distance between G 
and Gi and G and G 2 respectively. We established that in this task, Jaccard index 
provides an advantage over hit-number and coverage which is especially important 
for seeds of small weight. 

• We studied how well different estimators (coverage/hit-number combined with 
spaced/contiguous seed) correlate with the alignment quality, by measuring Spear¬ 
man’s rank correlation coefficient and mutual information coefficient. Here again, 
we observed a significantly better correlation produced by spaced seeds, but only 
when the alignment quality varies over a sufficiently large range, starting from 
identity rate around 0.9 or below. On the other hand, if only high-quality align¬ 
ment are targeted (id rate at least 0.9) then the correlation produced by spaced 
seeds becomes lower, with hit-number measure over a contiguous seed eventually 
becoming the best for alignments of id rate about 0.95 or more. 

• We also measured the correlation produced on real genomes within the metage- 

nomic classification approach of For this, we assumed that the “closeness” 

of a read to a genome is characterized by the quality of the alignment (in our case, 

^e.g. best results of Tablej^for weights 14-18 were obtained with QR-seeds 
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the number of mismatches), and computed the correlation produced by different 
counts on simulated reads. These experiments confirmed the superiority of spaced 
seeds as well. Moreover, they showed that coverage combined with spaced seeds 
provides the best option, leading to the highest Spearman’s correlation but also 
to a significantly higher Pearson’s correlation. The latter means that this esti¬ 
mator induces a dependency closer to linear, which can be a useful feature for 
classification algorithms. 


• Finally, we compared spaced and contiguous seeds through large-scale metage¬ 
nomics experiments. We modified Kraken software [Uj to make it work with 
spaced seeds, without modifying the core classification algorithm or any other 
parts of the software, with the only exception being the way the complementary 
sequences are dealt with. With just replacing contiguous seeds by spaced seeds, 
Kraken showed a consistent improvement of specificity/sensitivity trade-off at 
genus and family levels and a dataset-dependent improvement at the species level. 


Real data experiments of Sections 3.1.2, 3.3 have been done using SnakeMake 


Overall, all our experiments corroborate the thesis of better performance of spaced 
seeds for metagenomic classification. Many further questions are raised by this work. 
Our results remain to be explained with more rigorous probabilistic arguments, similarly 
to how it has been done for spaced seeds applied to seed-and-extend paradigm |l5]. While 
there are obvious similarities between the two applications, the underlying “mechanisms” 
seem to be different. One sign of this difference is that optimal spaced seeds for the two 
problems are not the same, as mentioned earlier. 

Experiments with Kraken (Section |3.4[ ) give a strong evidence that spaced seeds 
can improve the classification accuracy in real-life large-scale metagenomic projects. One 
further improvement would be to implement coverage and Jaccard measures that showed, 
in general, a better performance compared to the hit number. Introducing spaced seeds 
rises new issues, such as the construction of an efficient index of the database, or adapting 
the algorithm of computing the most likely node of the taxonomic tree from counts 
produced by individual genomes, i.e. leaves of the tree. These questions are a subject 
for future work. 
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